Development of the method for filtering verbal noise while search keywords for the English text

Oleg Bisikalo; Alexander Yahimovich; Yaroslav Yahimovich

doi:10.15587/2312-8372.2018.149962

Authors

Oleg Bisikalo Vinnitsa National Technical University, 95, Khmelnytske shose str., Vinnitsa, Ukraine, 21021, Ukraine https://orcid.org/0000-0002-7607-1943
Alexander Yahimovich Vinnitsa National Technical University, 95, Khmelnytske shose str., Vinnitsa, Ukraine, 21021, Ukraine https://orcid.org/0000-0001-6960-5823
Yaroslav Yahimovich Vinnitsa National Technical University, 95, Khmelnytske shose str., Vinnitsa, Ukraine, 21021, Ukraine https://orcid.org/0000-0003-2101-2791

DOI:

https://doi.org/10.15587/2312-8372.2018.149962

Keywords:

verbal noise filtering, English text keywords, linguistic package, DKPro Core, syntactic analysis

Abstract

The object of research is the processing of verbal information to identify keywords in the text. The most important step in the search for key terms is the calculation of their weights in the document in question, which makes it possible to evaluate their significance relative to each other in this context. To solve this problem, there are many approaches that are conditionally divided into two groups: they require learning and do not require learning. Learning implies the need to pre-process the original body of texts in order to extract information about the frequency of occurrence of terms in the entire body. An alternative approach is using linguistic ontologies, which are more or less approximate models of the existing set of words in a given language. On the basis of both approaches, systems are created for the automatic extraction of key terms. Nevertheless, in the direction of searching for keywords, research is not stopped in order to improve the accuracy and completeness of the results, as well as to use methods of extracting information from the text to solve new problems.

Existing approaches to the definition of keywords are characterized. The best quality of text processing is achieved by linguistic methods or when their combinations are statistical. A system for automatically determining key phrases from natural language text should be developed using the morphological dictionary and syntax rules.

The study uses an approach to defining keywords based on finding syntactic links between word forms in sentences in English text using the instrumental capabilities of modern linguistic packages. In the framework of the general approach to reducing verbal noise in the method, it is proposed that it is achieved with the help of formalized operations: the replacement of pronouns with the corresponding nouns; removal of noise connections; removing noise words; withdrawal of stop words. The described operations can be used as additional modules that improve the results of finding keywords for both the developed method for determining keywords of English text and other algorithms for finding keywords.

Author Biographies

Oleg Bisikalo, Vinnitsa National Technical University, 95, Khmelnytske shose str., Vinnitsa, Ukraine, 21021

Doctor of Technical Sciences, Professor

Department of Automation and Computer-Integrated Technologies

Alexander Yahimovich, Vinnitsa National Technical University, 95, Khmelnytske shose str., Vinnitsa, Ukraine, 21021

Postgraduate Student

Department of Automation and Computer-Integrated Technologies

Yaroslav Yahimovich, Vinnitsa National Technical University, 95, Khmelnytske shose str., Vinnitsa, Ukraine, 21021

Postgraduate Student

Department of Electronics and Nanosystems

References

Ershov, Yu. S. (2014). Vydelenie klyuchevykh slov v russkoyazychnykh tekstakh. Molodezhnyy nauchno-tekhnicheskiy vestnik, FS77-51038, 70–79.
Grashhenko, L. A. (2013). O model'nom stop-slovare. Izvestiya Akademii nauk Respubliki Tadzhikistan. Otdelenie fiziko-matematicheskikh, khimicheskikh, geologicheskikh i tekhnicheskikh nauk, 1 (150), 40–46.
Andreev, A. M., Berezkin, D. V., Syuzev, V. V., Shabanov, V. I. (2003). Modeli i metody avtomaticheskoy klassifikatsii tekstovykh dokumentov. Vestn. MGTU. Seriia Priborostroenie, 3, 64–94.
Abramov, E. G. (2011). Podbor klyuchevykh slov dlya nauchnoy stat'i. Nauchnaya periodika: problemy i resheniya, 1 (2), 35–40.
Darkulova, K. N., Ergeshova, G. (2014). Neobkhodimost' vydeleniya klyuchevykh slov dlya svertyvaniya teksta. Lingvisticheskiy analiz nauchnogo teksta. Yuzhno-Kazakhstanskiy gosudarstvennyy universitet im. Mukhtara Auezova Shymkent, 30–35.
Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001). On clustering validation techniques. Journal of intelligent information systems, 17 (2-3), 107–145. doi: http://doi.org/10.1023/a:1012801612483
Barahnin, V. B., Tkachev, D. A. (2010). Clustering of text documents based on composite key terms. Vestnik NSU. Series: Information Technology, 8 (2), 5–14.
Grashhenko, L. A. (2013). O model'nom stop-slovare. Izvestiya Akademii nauk Respubliki Tadzhikistan. Otdelenie fiziko-matematicheskikh, khimicheskikh, geologicheskikh i tekhnicheskikh nauk, 1 (150), 40–46.
Guo, A., Tao, Y. (2016). Research and Improvement of Feature Words Weight Based on TFIDF Algorithm. 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference. Chongqing. doi: http://doi.org/10.1109/itnec.2016.7560393
Grineva, M., Grinev, M., Boldakov, A., Novak, L., Syssoev, A., Lizorkin, D. (2009). Sifting Micro-blogging Stream for Events of User Interest. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. Boston, 327–333. doi: http://doi.org/10.1145/1571941.1572157
Reed, J., Jiao, Y., Potok, T., Klump, B., Elmore, M., Hurson, A. (2006). TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams. 2006 5th International Conference on Machine Learning and Applications. Orlando, 258–263. doi: http://doi.org/10.1109/icmla.2006.50
Mihalcea, R., Csomai, A. (2007). Wikify!: linking documents to encyclopedic knowledge. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. Lisbon, 233–242. doi: http://doi.org/10.1145/1321440.1321475
Astrakhantsev, N. (2014). Automatic term acquisition from domain-specific text collection by using Wikipedia. Proceedings of the Institute for System Programming of RAS, 26 (4), 7–20. doi: http://doi.org/10.15514/ispras-2014-26(4)-1
Özgür, A., Hur, J., He, Y. (2016). The Interaction Network Ontology-supported modeling and mining of complex interactions represented with multiple keywords in biomedical literature. BioData Mining, 9 (1). doi: http://doi.org/10.1186/s13040-016-0118-0
Wong, W., Liu, W., Bennamoun, M. (2012). Ontology learning from text. ACM Computing Surveys, 44 (4), 1–36. doi: http://doi.org/10.1145/2333112.2333115
Korobkin, D. M., Fomenkov, S. A., Kolesnikov, S. G. (2015). Method of ontology-based extraction of physical effect description. Vestnik Komp’iuternykh i Informatsionnykh Tekhnologii, 28–35. doi: http://doi.org/10.14489/vkit.2015.02.pp.028-035
Besplatnyy onlayn-generator klyuchevykh slov s teksta. Available at: http://seotool.by/analiz/seo/keywordstext.php
Generator klyuchevykh slov s teksta. Available at: http://www.rise-top.com
Advego. Available at: http://wiki.advego.ru/index.php/Адвего
Natural Language Processing: Integration of Automatic and Manual Analysis (2014). Available at: http://tuprints.ulb.tu-darmstadt.de/4151/1/rec-thesis-final.pdf
Bisikalo, O. V., Wójcik, W., Yahimovich, O. V., Smailova, S. (2016). Method of determining of keywords in English texts based on the DKPro Core. Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2016. doi: http://doi.org/10.1117/12.2249225
Determiner. Available at: http://universaldependencies.org/u/dep/det.html
Expletive and Reflexives. Available at: http://universaldependencies.org/u/dep/expl.html
Welo, E. (2013). Null Anaphora. Encyclopedia of Ancient Greek Language and Linguistics. doi: http://doi.org/10.1163/2214-448x_eagll_com_00000254
Manning, C., de Marneffe, M. (2016). Stanford typed dependencies manual. Available at: https://nlp.stanford.edu/software/dependencies_manual.pdf
Fixed multiword. Available at: http://universaldependencies.org/u/dep/fixed.html
Punctuation. Available at: http://universaldependencies.org/u/dep/punct.html
Root. Available at: http://universaldependencies.org/u/dep/root.html
Taylor, A., Marcus, M., Santorini, B. (2003). The Penn Treebank: An Overview. Text, Speech and Language Technology, 5–22. doi: http://doi.org/10.1007/978-94-010-0201-1_1
Penn Treebank II Constituent Tags: Word level. Available at: http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html#Word
Alphabetical list of part-of-speech tags used in the Penn Treebank Project. Available at: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Bougé, K. Lists of stop words. Available at: https://sites.google.com/site/kevinbouge/stopwords-lists

Development of the method for filtering verbal noise while search keywords for the English text

Authors

DOI:

Keywords:

Abstract

Author Biographies

Oleg Bisikalo, Vinnitsa National Technical University, 95, Khmelnytske shose str., Vinnitsa, Ukraine, 21021

Alexander Yahimovich, Vinnitsa National Technical University, 95, Khmelnytske shose str., Vinnitsa, Ukraine, 21021

Yaroslav Yahimovich, Vinnitsa National Technical University, 95, Khmelnytske shose str., Vinnitsa, Ukraine, 21021

References

Downloads

Published

How to Cite

Issue

Section

License

Information site

Language

Information

Developed By

Current Issue